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Mammoth DNA 


Hendrik N. Poinar,’2?* Carsten Schwarz," Ji Qi, Beth Shapiro,” Ross D. E. MacPhee,° 
Bernard Buigues,’ Alexei Tikhonov,® Daniel H. Huson,” Lynn P. Tomsho,’ Alexander Auch,” 
Markus Rampp,?° Webb Miller,* Stephan C. Schuster** 


We sequenced 28 million base pairs of DNA in a metagenomics approach, using a woolly 
mammoth (Mammuthus primigenius) sample from Siberia. As a result of exceptional sample 
preservation and the use of a recently developed emulsion polymerase chain reaction and 
pyrosequencing technique, 13 million base pairs (45.4%) of the sequencing reads were identified 
as mammoth DNA. Sequence identity between our data and African elephant (Loxodonta africana) 
was 98.55%, consistent with a paleontologically based divergence date of 5 to 6 million years. 
The sample includes a surprisingly small diversity of environmental DNAs. The high percentage 
of endogenous DNA recoverable from this single mammoth would allow for completion of its 


genome, unleashing the field of paleogenomics. 


omplete genome sequences of extinct 
( species will answer long-standing ques- 

tions in molecular evolution and allow 
us to tackle the molecular basis of speciation, 
temporal stages of gene evolution, and inter- 
mediates of selection during domestication. To 
date, fossil remains have yielded little genetic 
insight into evolutionary processes because of 
poor preservation of their DNA and our limited 
ability to retrieve nuclear DNA (nDNA). 
Most DNA extracted from fossil remains is 
truncated into fragments of very short length 
[<300 base pairs (bp)] from hydrolysis of the 
DNA backbone, cross-linking due to conden- 
sation (/, 2), and oxidation of pyrimidines 
(3), which prevents extension by Tag DNA 
polymerase during polymerase chain reaction 
(PCR). In addition, DNA extracts are a mixture 
of bacterial, fungal, and often human contam- 
inants, complicating the isolation of endoge- 
nous DNA. In the past, these problems could 
only be indirectly overcome by concentrating 
on the small number of genes present on the 
maternally inherited mitochondrial genome, 
which is present in high copy number in ani- 
mal cells. This approach severely limits access 
to the storehouses of genetic information po- 
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tentially available in fossils of now-extinct spe- 
cies. In a few rare cases, investigators have 
managed to isolate and characterize nuclear 
DNA from fossil remains preserved in arid cave 
deposits (4—6) or, more commonly, permafrost- 
dominated environments (7, 8) and ice (9), 
where the average burial temperature can be as 
low as —10°C (J0). Under these conditions, 
preservation is enhanced by reduced reaction 
rates: In permafrost settings, theoretical cal- 
culations predict DNA fragment survival up to 
1 million years (11, 12). 

Although more nuclear DNA is present in cells 
than mitochondrial DNA, access to the nuclear 
genome even in well-preserved fossil material re- 
mains difficult with PCR-based approaches, 
which must target known sequences from specific 
genes. To find the ideal sample and analytical 
approach for paleogenomics, we screened eight of 
the morphologically best preserved mammoth 
remains in the collections of the Mammoth 
Museum, a dedicated permafrost “ice cave” 
facility in the town of Khatanga in the southeastern 
part of the Taimyr Peninsula, Russian Federation, 
and maintained by Cerpolex/Mammuthus Expe- 
ditions (CME). We extracted DNA from these 
samples, using ancient DNA methodology, to 
avoid to the greatest extent possible exogenous 
DNA (13), because we are well aware of the 
problems and pitfalls associated with potential 
contamination. These samples were screened 
with a quantitative PCR (qPCR) assay designed 
for the mammoth mitochondrial cytochrome b 
gene (/4), with primers designed to amplify 
mitochondrial DNA molecules from both African 
and Asian elephants (Loxodonta africana and 
Elephas maximus). We quantitated the number of 
amplifiable mitochondrial DNA (mtDNA) mole- 
cules of 84 bp in length. The eight samples 
ranged from 1 x 10° copies per gram to 96 x 
10° copies per gram, all excellent samples as 
judged by ancient DNA standards (/5). How- 
ever, one sample in particular, CME 2005/915, 
was exceptional. The specimen, an edentulous 


mandible dated to 27,740 + 220 '4C years before 
the present (uncorrected; Beta 210777), was 
recovered on the shore of Baikura-turku, a large 
bay on the southeastern side of Lake Taimyr, the 
largest freshwater body in Eurasia north of the 
Arctic Circle. The large numbers and high 
quality of late Quaternary fossils recovered 
from the Taimyr Peninsula have prompted sev- 
eral major investigations in recent years (/6), 
including studies of ancient DNA (14, 17, 18). 
Taimyr’s extremely cold winters, combined with 
short, cool summers and little annual precipita- 
tion, have ensured that conditions optimal for the 
preservation of bones and teeth prevailed there 
for most or all of the late Pleistocene. 

To obtain a better perspective on the preser- 
vational integrity of this sample, we quantitated 
the number of amplifiable mtDNA cytochrome b 
gene fragments of increasing length up to 1 Kb 
(fig. S1) (73). The DNA extract (from 58 mg 
bone) contained ~7 x 10° copies of the 84-bp 
mtDNA fragment in 100 ul or 121 x 10° copies 
per gram of bone (/3). On the basis of this, we 
estimated that the concentration of total am- 
plifiable DNA in our sample (assuming a copy- 
number ratio of 1000:1, mtDNA:nDNA) was on 
the order of 0.73 ug of mammoth DNA per gram 
of bone. We extracted 1 g of bone and concen- 
trated the DNA to 100 ul, which was subsequently 
used for library construction and sequencing tech- 
nology that recently became available (73, 19). 
This in vitro technique circumvents amplification 
or cloning biases by compartmentalizing single 
DNA molecules before the amplification step in a 
lipid vesicle, thereby maintaining the original 
DNA template distribution. The lipid-enclosed 
single DNA molecule, attached to a sepharose 
bead 28 um in diameter, undergoes a PCR 
reaction, yielding sufficient DNA copies for 
sequencing. Sequencing is performed with a 
pyrosequencing methodology (19). 

We obtained 302,692 sequence reads aver- 
aging 95 bp, with read length being limited by the 
sequencing approach rather than by the 630-bp 
average DNA fragment size obtained from the 
extract after shearing. The total sequence data 
produced for the bone metagenome was 28 x 
10° bp. We aligned the sequencing reads with 
current (November 2005) assemblies of the ge- 
nome sequences of African elephant (L. africana), 
human, and dog (Canis familiaris), downloaded 
from www.genome.ucsc.edu (Table 1 and Fig. 1). 
Alignments were computed by the program 
BLASTZ (20), with parameters chosen to iden- 
tify only regions of about 90% identity or higher 
(13). As we have not detected any convincing 
mappings of a read to the human Y chromosome, 
despite random distribution across the genome, 
we conclude that our mammoth was a female. 

A total of 137,527, or 45.4% of all reads, 
aligned to the African elephant genome (Fig. 1, 
Table 1) (73), currently available at 2.2-fold 
coverage, with an estimated number of base 
pairs in the genome of 2.3 x 10° bp (www. 
broad.mit.edu/mammals/#chart). A twofold cov- 
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Table 1. Total percent of aligned reads and their relative identities to African elephant, human, 


and dog. 


Elephant 


Human Dog 


Total no. reads 

Aligned reads 

Uniquely aligning reads 

Multiply aligned reads 

Reads with at least 95% identity 
Reads with 100% identity 


302,692 (100%) 
137,527 (45.4%) 
44,442 (14.7%) 
93,085 (30.8%) 
90,507 (30.0%) 
21,952 (7.3%) 


302,692 (100%) 
4,237 (1.4%) 
3,901 (1.3%) 

336 (0.1%) 
1,184 (0.4%) 
116 (0.04%) 


302,692 (100%) 
3,775 (1.2%) 
3,548 (1.2%) 

227 (0.1%) 
1,140 (0.4%) 
142 (0.05%) 


Uniquely aligning base pairs 4,332,350 318,966 291,714 
Identity in unique alignments 98.55% 92.68% 92.91% 
Mitochondrial reads 209 = - 
Identity in mitochondrial reads 95.93% - = 
Mitochondrial base pairs 16,419 - - 
Fig. 1. Characterization of E Mammoth 


the mammoth metagenomic 
library, including percent- 
age of read distributions to 
various taxa. Host organism 
prediction based on BLASTZ 
comparison against GenBank 
and environmental sequences 
database. 


9.09% 


E Unidentified 
sequences 
18.42% 


H Environmental 


(predicted to align) 


Mammoth 
(alignable 
to elephant) 
45.43% 


sequences 
14.15% OVirus 
0.09% [O Alignable to human 
HOther eukaroyta 1.40% 
4.15% BArchaea E Bacteria — HAlignable to dog 
0.24% 5.76% 1.25% 


erage approximates only 80% of the total ge- 
nome, so a conservative estimate is that half of 
our reads would align to a completed elephant 
sequence. Among all reads, 44,442 (14.7%) 
aligned to only one position in the elephant 
genome, and 21,952 (7.3%) exhibited a per- 
fect (100%) match, up to a read length of 132 bp. 
To test whether the observed hits were more 
likely to be derived from endogenous mam- 
moth DNA, as opposed to potential contam- 
inants such as human DNA, we repeated the 
BLASTZ analyses as above, this time compar- 
ing our sequence reads to the currently avail- 
able versions of the human and dog genomes. 
Only 4237 reads (1.4%) aligned to human and 
3775 (1.2%) to dog (at our threshold of ap- 
proximately 90% identity). Between 1% and 
5% of any two distantly related mammalian 
genomes should align at 90% identity or greater, 
because roughly 0.5% of these genomes con- 
sist of protein-coding segments conserved at 
that level (27), and noncoding DNA contributes 
a somewhat larger fraction (22). Thus, the frac- 
tion of our reads that show at least 90% identity 
with human and dog is what is expected if only 
mammoth DNA were sequenced. 

To further assess the possibility of contami- 
nation in our DNA sample, we explicitly 
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considered what would be expected if contam- 
inating human DNA sequences were present. If 
some of our reads were human DNA, then these 
reads should align with nearly 100% identity to 
human, and at most, 5% of these could be ex- 
pected to align at 90% identity or higher to any 
nonprimate mammalian genome (only about 5 
to 6% of the genome appears to be under neg- 
ative selection (23), and our 90% threshold is far 
above the neutral level). If our data contained 
human reads, we could be essentially certain that a 
large fraction of them would align to human at or 
above 97% identity over at least 80% of their 
length and not align to either elephant or dog. Only 
14 reads satisfy these criteria. Thus, there are, at 
most, trace levels of human contamination in the 
sample or the process, and the 14 potential exam- 
ples may simply be regions that are well conserved 
between mammoth and all other mammals but that 
happen to be missing from the current assemblies 
of the dog and elephant genomes. We applied the 
same approach to look for potential contamination 
of dog DNA in our sample and found none. 

To determine substitution patterns between 
mammoth and other species (table S2), we used 
the subset of reads that aligned to only one po- 
sition. We observed 98.55% identity between 
mammoth and elephant. As this number does 
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not correct for alterations of the sequence due to 
damage caused by base decomposition, we are 
likely underestimating the amount of sequence 
similarity. Base damage in fossil matrices can 
result in a myriad of base changes; however, 
the most commonly observed change has been 
deamination of cytosine to uracil (24, 25) re- 
sulting in C-to-T and G-to-A substitutions. 
We therefore looked for asymmetries in base 
composition between modern and ancient ge- 
nomes (table S2). We reasoned that C-to-T 
transitions due to DNA damage would manifest 
as an excess of elephant C aligned to mam- 
moth T over elephant T aligned to mammoth C. 
Indeed, the ratio of the rates to and from mam- 
moth relative to elephant is 1.91 (15297/7990) 
for C-to-T and 1.15 (9530/8252) for G-to-A 
[C-to-T human 0.96 (3849/4027); dog 1.11 
(3622/3276); G-to-A human 0.87 (3541/4060); 
and dog 0.97 (3321/3440)]. Thus, we find no- 
ticeable deamination of cytosines in our ex- 
tract. In addition, we analyzed the frequencies 
with which substitutions likely to be attrib- 
utable to postmortem damage occurred in the 
amplified fragments of mtDNA by compar- 
ison with the publicly available mammoth 
mitochondrion genome (GenBank accession 
#NC_007596, DQ188829) (13). We found 222 
reads that aligned to the public GenBank mam- 
moth mitochondrial genome (Fig. 2). Two hun- 
dred nine reads gave a total of 18,581 bp, 7617 
bp of which were overlapping, resulting in a 
total coverage of 10,964 bp of a possible 
16,770 bp (65%). One hundred fourteen of the 
209 reads (55%) matched the previously pub- 
lished sequence exactly. One hundred nine 
base differences were observed between the 
reads and the published sequence, 49 of which 
were not supported by overlapping reads. The 
majority of these substitutions (84%) were ei- 
ther C-to-T or G-to-A transitions, as is expected 
if the substitutions were due to postmortem 
DNA deamination. The remaining 13 reads 
differed significantly from the published data 
and may be evidence of potential nuclear 
inserts of DNA from the mitochondrion (/3), 
which have been reported previously to be com- 
mon in elephants and mammoths (26). 

These findings are well within the predicted 
levels of damage for ancient DNA and dem- 
onstrate the feasibility and benefits of ancient 
whole-genome sequencing without previous am- 
plification, as overlapping reads from a multifold 
coverage would easily correct for decomposi- 
tional base changes accrued during the sample’s 
depositional history and serve as a DNA damage 
correction filter. The ratio of mtDNA to nuclear 
DNA for our sample was 1:658, which agrees 
with what one would expect given a 1:1000 
copy-number ratio for nDNA versus mtDNA. 

Despite the presence in our sample of an 
exceptionally high percentage (54.5%, including 
reads predicted to align to elephant) of mam- 
moth DNA, relative to environmental contam- 
inants, 45.5% of the total DNA derives from 
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Fig. 2. Distribution of 209 
reads along the mam- 
moth mitochondrial ge- 
nome (GenBank accession 
#NC_007596, DQ188829). 
Average fragment length 
was 89 bp; not shown to 
scale. To determine whether 
the reads were randomly 
distributed along the ge- 
nome, we compared the 
distribution of fragment 
lengths that would result 
from cutting the genome 
at the 5’ end of each read 
against that resulting from 
100 million randomly gen- 
erated distributions of 209 
reads. Despite multiple 
overlapping regions, the 
real distribution was not 
significantly different from 
the empirical distribution 
of fragment lengths (P = 
0.069). 


endogenous bacteria and nonelephantid envi- 
ronmental contaminants. In addition to ubiqui- 
tous contaminants resulting from handling or 
conditions of storage, these exogenous species 
are likely to represent taxa present at or im- 
mediately after the time of the mammoth’s 
death, thereby contributing to the decomposition 
of the remains. To acquire a glimpse of the 
biodiversity of these communities, we have de- 
vised software (GenomeTaxonomyBrowser) 
(27, 28) that allows for the taxonomic iden- 
tification of various species on the basis of 
sequence comparison and current phylogenetic 
classification at the National Center for Biotech- 
nology Information (NCBI) taxonomy browser 
as of November 2005 (www.ncbi.nlm.nih.gov/ 
Taxonomy/taxonomyhome.html). We compared 
302,692 reads (100%) against the nonredundant 
(nr) and the environmental database (env_nr). 
Using an adjustable factor for bitscore (13), we 
classified the reads simultaneously to the indi- 
vidual kingdom, phylum, class, order, family, 
and genus down to the species level wherever 
possible (fig. S2 and table S3), excluding all hits 
matching Gnathostomata (jawed vertebrates). 
The remaining 12,563 hits within the Eu- 
karyota (4.15%) were only surpassed by the 
number of bacterial hits, 17,425 (5.76%), and 
hits against the environmental database, 42,816 
(14.15%). The kingdom Archaea was hit in- 
frequently with only 736 hits (0.24%). Within 
this group, the Euryarchaeota dominated the 
Crenarchaeota by a ratio of 16:1. In the bacte- 
rial superkingdom, the most prevalent species 
were found to be proteobacteria, 5282 (1.75%); 
Firmicutes (gram-positives), 940 (0.31%), most- 
ly Bacilli and Chlostridia; Actinobacteria, 2740 
(0.91%); Bacteroidetes, 497 (0.16%); and the 
group of the Chlorobi bacteria, 248 (0.08%). Oth- 
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Mammoth mtDNA 


Ml Ribosomal RNA 
HE Coding genes 


(Gs Transfer RNA 
E J Non-coding 


er identified microorganisms included the fun- 
gal taxa Ashbya, Aspergillus, and Neurospora/ 
Magnaporte with 440 hits (0.14%). We also 
found 278 hits against viral sequences, which 
could be assigned to dsDNA viruses, 193 hits 
(0.06%); retrotranscribing viruses, 20 hits 
(0.01%); and ssRNA viruses, 46 hits (0.02%). 

The soil-inhabiting eukaryotic species, 
Dictyosteliida, with 127 hits (0.04%), and 
Entamoeba, with 64 hits (0.02%), were found 
to be underrepresented, as were hits against 
the two nematode genomes, 277 hits (0.09%). 
A detailed identification of plant species is hand- 
icapped, because presently only the two plant 
genomes, Arabidopsis thaliana (mouse-ear cress) 
and Oryza sativa (rice), are publicly availa- 
ble. We found the hits against grass species to 
outnumber the ones from Brassicales by a ratio 
of 3:1, which could be indicative of ancient 
pastures on which the mammoth is believed 
to have grazed. 

From this classification, it is evident that 
nonvertebrate eukaryotic and prokaryotic spe- 
cies occur at approximately equal ratios, with 
the mammalian fraction dominating the iden- 
tifiable fraction of the metagenome. The pau- 
city of fungal species is surprising, as is the 
low number of reads from nematodes. 

Recently, a whole-genome approach was 
attempted from DNA of the extinct cave bear 
Ursus spelaeus, yielding ~27,000 bp of endog- 
enous genetic material from 1.1 to 5.8% of all 
DNA reads (29). We have produced 13 million 
bp of endogenous genetic material from 45% of 
all DNA reads, some 480 times as much DNA 
sequence and 15 times the percentage. The abil- 
ity to obtain this level of genetic information 
from extinct species makes it possible to con- 
sider detailed analysis of functional genes and 


fine-scale refinement of mutation rates. A rapid 
identification assay of single nucleotide poly- 
morphisms (SNPs) would be of great value 
for studying population genetics of Pleistocene 
mammals and plants, which in turn could help 
elucidate their responses to climate changes 
during late glacial and early postglacial time 
and ultimately shed new light on the cause and 
consequences of late Quaternary extinctions. 
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